The AMARA Corpus: Building Resources for Translating the Web’s Educational Content

نویسندگان

  • Francisco Guzman
  • Hassan Sajjad
  • Stephan Vogel
  • Ahmed Abdelali
چکیده

In this paper, we introduce a new parallel corpus of subtitles of educational videos: the AMARA corpus for online educational content. We crawl a multilingual collection community generated subtitles, and present the results of processing the Arabic–English portion of the data, which yields a parallel corpus of about 2.6M Arabic and 3.9M English words. We explore different approaches to align the segments, and extrinsically evaluate the resulting parallel corpus on the standard TED-talks tst-2010. We observe that the data can be successfully used for this task, and also observe an absolute improvement of 1.6 BLEU when it is used in combination with TED data. Finally, we analyze some of the specific challenges when translating the educational content.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The AMARA Corpus: Building Parallel Language Resources for the Educational Domain

This paper presents the AMARA corpus of on-line educational content: a new parallel corpus of educational video subtitles, multilingually aligned for 20 languages, i.e. 20 monolingual corpora and 190 parallel corpora. This corpus includes both resource-rich languages such as English and Arabic, and resource-poor languages such as Hindi and Thai. In this paper, we describe the gathering, validat...

متن کامل

Translation Evaluation in Educational Settings for Training Purposes

The following article describes different methods and techniques used in educational settings for translation evaluation. Translation evaluation is the placing of value on a translation i.e. awarding a mark, even if only a binary pass/fail one. In the present study, different features of the texts chosen for evaluation were firstly considered and then scoring the t...

متن کامل

Norms of Translating Taboo Words and Concepts from English into Persian after the Islamic Revolution in Iran

The research attempted to discover the norms of translating taboo words and concepts after the Islamic Revolution in Iran using Toury’s (1995) framework for classification of norms. The corpus of the study composed of Coelho’s novels between 1990 and 2005 and their Persian translations which were prepared and analyzed manually to discover the norms. During both the selection of novels for trans...

متن کامل

Leveraging Content from Open Corpus Sources for Technology Enhanced Learning

As educators attempt to incorporate the use of educational technologies in course curricula, the lack of appropriate and accessible digital content resources acts as a barrier to adoption. Quality educational digital resources can prove expensive to develop and have traditionally been restricted to use in the environment in which they were authored. As a result, educators who wish to adopt thes...

متن کامل

Slicepedia: Automating the Production of Educational Resources from Open Corpus Content

The World Wide Web (WWW) provides access to a vast array of digital content, a great deal of which could be ideal for incorporation into eLearning environments. However, reusing such content directly in its native form has proven to be inadequate, and manually customizing it for eLearning purposes is labor-intensive. This paper introduces Slicepedia, a service which enables the discovery, reuse...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013